Term Weighting Schemes for Latent Dirichlet Allocation

نویسندگان

  • Andrew T. Wilson
  • Peter A. Chew
چکیده

Many implementations of Latent Dirichlet Allocation (LDA), including those described in Blei et al. (2003), rely at some point on the removal of stopwords, words which are assumed to contribute little to the meaning of the text. This step is considered necessary because otherwise high-frequency words tend to end up scattered across many of the latent topics without much rhyme or reason. We show, however, that the ‘problem’ of high-frequency words can be dealt with more elegantly, and in a way that to our knowledge has not been considered in LDA, through the use of appropriate weighting schemes comparable to those sometimes used in Latent Semantic Indexing (LSI). Our proposed weighting methods not only make theoretical sense, but can also be shown to improve precision significantly on a non-trivial cross-language retrieval task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Novel weighting scheme for unsupervised language model adaptation using latent dirichlet allocation

A new approach for computing weights of topic models in language model (LM) adaptation is introduced. We formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. The new weighting idea is that the unigram count of the topic generated by hard-clus...

متن کامل

FFTM: A Fuzzy Feature Transformation Method for Medical Documents

The vast array of medical text data represents a valuable resource that can be analyzed to advance the state of the art in medicine. Currently, text mining methods are being used to analyze medical research and clinical text data. Some of the main challenges in text analysis are high dimensionality and noisy data. There is a need to develop novel feature transformation methods that help reduce ...

متن کامل

Robust audio-codebooks for large-scale event detection in consumer videos

In this paper we present our audio based system for detecting “events” within consumer videos (e.g. You Tube) and report our experiments on the TRECVID Multimedia Event Detection (MED) task and development data. Codebook or bag-of-words models have been widely used in text, visual and audio domains and form the state-of-the-art in MED tasks. The overall effectiveness of these models on such dat...

متن کامل

LDA - B ASED I NDUSTRY C LASSIFICATION Research - in - Progress

Industry classification is a crucial step for financial analysis. However, existing industry classification schemes have several limitations. In order to overcome these limitations, in this paper, we propose an industry classification methodology on the basis of business commonalities using the topic features learned by the Latent Dirichlet Allocation (LDA) from firms’ business descriptions. Tw...

متن کامل

Discussion of "The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling"

Mixed-membership models (e.g. “topic models”) are inarguably popular; especially latent Dirichlet allocation (LDA) [Blei et al., 2003] and its variants. Such models have become a fundamental tool in the analysis and exploration of many types of data. Originally designed to model text documents as per-word draws from a document-specific weighting of a finite collection of “topics” (distributions...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010